Self-Organizing Approach for Finding Borders of DNA Coding Regions

نویسندگان

  • Fang Wu
  • Wei-Mou Zheng
چکیده

A self-organizing approach is proposed for gene finding based on the model of codon usage for coding regions and positional preference for noncoding regions. The symmetry between the direct and reverse coding regions is adopted for reducing the number of parameters. Without requiring prior training, parameters are estimated by iteration. By employing the window sliding technique and likelihood ratio, a very accurate segmentation is obtained. PACS number(s): 87.15.Cc, 87.14.Gg, 87.10.+e The data of raw DNA sequences is increasing at a phenomenal pace, providing a rich source of data to study. As a consequence, we now face the tremendous challenge of extracting information from the formidable volume of DNA sequence data. Computational methods for reliably detecting protein-coding regions are becoming more and more important. Genome annotation by statistical methods is based on various statistical models of genomic sequences [1, 2], one of the most popular being the inhomogeneous, three-period Markov chain model for protein-coding regions with an ordinary Markov model for noncoding regions. The independent random chain model can be included in this category by regarding it as a Markov chain of order 0. The codon usage model is the independent random chain model of non-overlapping triplets, and corresponds to an inhomogeneous Markov model of order 2. Signals in a short segment are usually buried in large fluctuations. With well chosen parameters statistical models work as a noise filter to pick out the signals. Methods based on local inhomogeneity, e.g. position asymmetry or periodicity of period 3, suffer fluctuations. Most of the current computer methods for locating genes require some prior knowledge of the sequence’s statistical properties such as the codon usage or positional preference [3, 4, 5]. That is, a sizable training set is necessary for estimating good parameters of the model in use [6, 7]. Strongly biased by the training, such models have little power to discover surprising or atypical features. Thus, it is desirable to decipher the genomic information in an objective way. Audic and Claverie [8] have proposed a method which does not require learning of species-specific features from an arbitrary training set for predicting proteincoding regions. They use an ab initio iterative Markov modeling procedure to automatically partition genome sequences into direct coding, reverse coding, and noncoding segments. This is an expectation-maximization (EM) algorithm, which is useful in modeling with hidden variables, and is performed in two steps of expectation and maximization [9, 10, 11]. Such a self-organizing or adaptive approach uses all the available unannotated genomic data for its calibration. Before introducing the model we use and describing the technical details, we explain the EM algorithm with a simple pedagogic model which assumes that a DNA sequence written in four letters {a, c, g, t} is generated by independent tosses of two four-sided dice. An annotation maps the DNA sequence site-to-site to a two-letter sequence of the alphabet {C,N} (C for coding and N for noncoding). Two sets {pa, pc, pg, pt} and {qa, qc, qg, qt} of positional nucleotide probabilities are associated with the two dice C andN , respectively. The total probability for the given DNA sequence S = s1s2 . . . to be seen under the model is the partition or likelihood function Z = ∑

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Finding borders between coding and noncoding DNA regions by an entropic segmentation method.

We present a new computational approach to finding borders between coding and noncoding DNA. This approach has two features: (i) DNA sequences are described by a 12-letter alphabet that captures the differential base composition at each codon position, and (ii) the search for the borders is carried out by means of an entropic segmentation method which uses only the general statistical propertie...

متن کامل

Divergence Measures for Dna Segmentation

Entropy-based divergence measures have shown promising results in many areas of engineering and image processing. In this study, we use the Jensen-Shannon and Jensen-Rényi divergence measures for DNA segmentation. Based on these information theoretic measures and protein shape coded in DNA, we propose a new approach to the problem of finding the borders between coding and noncoding DNA regions....

متن کامل

تخمین مکان نواحی کدکننده پروتئین در توالی عددی DNA با استفاده پنجره با طول متغیر بر مبنای منحنی سه بعدی Z

In recent years, estimation of protein-coding regions in numerical deoxyribonucleic acid (DNA) sequences using signal processing tools has been a challenging issue in bioinformatics, owing to their 3-base periodicity. Several digital signal processing (DSP) tools have been applied in order to Identify the task and concentrated on assigning numerical values to the symbolic DNA sequence, then app...

متن کامل

Segmentation of DNA into Coding and Noncoding Regions Based on Recursive Entropic Segmentation and Stop-Codon Statistics

Heterogeneous DNA sequences can be partitioned into homogeneous domains that are comprised of the four nucleotides A, C, G, and T and the stop codons. Recursively, we apply a new entropic segmentation method on DNA sequences using Jensen-Shannon and Jensen-Rényi divergences in order to find the borders between coding and noncoding DNA regions. We have chosen 12and 18-symbol alphabets that captu...

متن کامل

A New Approach to Gene Prediction Using the Self-Organizing Map

In this poster we present a gene prediction approach based on the Self-Organizing Map that has the ability to automatically identify all the major patterns of content variation within a genome. The genome may then be scanned for regions displaying the same properties as one of these automatically identified models. Even using a relatively simple coding measure (codon usage), this method can pre...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001